-
Notifications
You must be signed in to change notification settings - Fork 66
Conversation
Hi contributor, thanks for your PR. This patch needs to be approved by someone of admins. They should reply with "/ok-to-test" to accept this PR for running test automatically. |
/run-all-tests |
Perhaps use #113 as the base branch to reduce the diff? |
e33b35f
to
55753ea
Compare
/run-all-tests |
55753ea
to
7f0d002
Compare
/run-all-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: This is not "dynamic" batch size, just "non-uniform" batch size. I expected dynamic means we'll adjust the batch size during runtime 🙃
7f0d002
to
451015d
Compare
786058a
to
9d121ba
Compare
/run-all-tests |
/run-all-tests |
1 similar comment
/run-all-tests |
* Use the exact result of 1/Beta(N, R) instead of an approximation * When the number of engines is small and the total engine size of the first (table-concurrency) batches exceed the table size, the last batch was truncated, and disrupt the pipeline. Now in these case we will reduce the batch size to avoid this disruption.
/run-all-tests |
PTAL @csuzhangxc @GregoryIan Since both @lonng and I participated in this PR, we need somebody else to review the PR 🙃. If approved, this PR will be merged into #113, and then #113 will be merged immediately into master. Summary of the changes:
|
ok,I will review it tomorrow or the day after tomorrow |
# Lightning will slightly increase the size of the first few batches to properly distribute | ||
# resources. The scale up is controlled by this parameter, which expresses the speed ratio between | ||
# the "import" and "write" steps. If "import" is faster, the batch size anomaly is smaller, and | ||
# zero means uniform batch size. This value should be in the range (0 <= batch-import-ratio < 1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how to get import speed
, should we provode a constant value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've expanded the comment a bit. This can be calculated by (import duration / write duration)
of a single small table (e.g. ~1 GB).
I do suspect that the ratio is not a constant. It could be affected by the table structure, for instance. But for the 3 tables we've tested the ratio does approach this value. We could optimize the calculation later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just mean give import duration
a referenced value, but what's import duration
and how to get it for user ? user must have a try and then find it in log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we give users a series of different recommended values for different deployments later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can explain how to choose the best value in the docs and to OPS. But we don't expect users are going to change these values unless they want to do some heavy optimization.
invBetaNR := math.Exp(logGammaNPlusR - logGammaN - logGammaR) // 1/B(N, R) = Γ(N+R)/Γ(N)Γ(R) | ||
for { | ||
if n <= 0 || n > tableConcurrency { | ||
n = tableConcurrency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check whether user set a unreasonable table conc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in addition, it may be better to compute n
by given table conc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't directly set n = tableConcurrency
as this may produce engine size which is too large or too small.
Example of too large: 8T table with table-conc = 8, forcing each batch to be ~1T, causing pressure on importer.
Example of too small: 200G table with table-conc = 8, forcing each batch to be ~25G, making the data sent to TiKV less sorted and increases compaction cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I means we compute n
by given batchImportRatio
and batchSize
, what would happen if batchSize
is unreasonable (like too small) and table conc
is also small? Will the algorithm degenerate to near-order import?
// ≲ N/(1-R) | ||
// | ||
// We use a simple brute force search since the search space is extremely small. | ||
ratio := totalDataFileSize * (1 - batchImportRatio) / batchSize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not compute N
at here directly, just try to reduce batch size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there's no simple formula to solve N
in X = N - 1/beta(N, R)
😅.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N
is in a limit values range, means maybe we can use a heuristic way
LGTM |
LGTM |
/run-all-tests |
* config,restore: introduced `[mydumper] batch-size` Removed `[tikv-importer] batch-size` to avoid confusion. Removed `[mydumper] min-region-size` since it is useless now. * restore,mydump: pre-allocate engine IDs * restore: separate table checkpoints and engine checkpoints * importer: stop exposing the UUID * checkpoints: make checkpoint diff understand 1 table = many engines * checkpoints: make file checkpoints recognize multiple engines * checkpoints: migrated MySQL-based checkpoint to multi-engine as well * restore: adapt restore workflow for multi-engine * tests: added test case for multi-engine * *: fixed code * *: addressed comments * *: addressed comments * Support non-uniform batch size (#114) * mydump: non-uniform batch size * *: make the `batch-size-scale` configurable * *: implemented the optimized non-uniform strategy * tests: due to change of strategy, checkpoint_engines count becomes 4 again * mydump/region: slightly adjust the batch size computation * Use the exact result of 1/Beta(N, R) instead of an approximation * When the number of engines is small and the total engine size of the first (table-concurrency) batches exceed the table size, the last batch was truncated, and disrupt the pipeline. Now in these case we will reduce the batch size to avoid this disruption. * restore: log the SQL size and KV size of each engine for debugging * config: change default batch size and ratio given experiment result * config: added more explanation about batch-import-ratio Co-authored-by: Lonng <chris@lonng.org>
What problem does this PR solve?
import() step will not be concurrent.
If multiple Batch end times are close, it will result in multiple Batch import serials.
What is changed and how it works?
curBatchSize = batchSize / int64(batchSizeScale-curEngineID)
Check List
Tests
Code changes
Side effects
Related changes